::p_load(sf, tidyverse, tmap, tidyverse, corrplot, funModeling, corrplot, psych, ggpubr, cluster, factoextra, heatmaply, spdep, ClustGeo, rgdal, NbClust, GGally, patchwork, ggthemes, knitr) pacman
Take-home EX2
Take-home Exercise 2: Regionalisation of Multivariate Water Point Attributes with Non-spatially Constrained and Spatially Constrained Clustering Methods
Setting the Scene
The process of creating regions is called regionalisation. A regionalisation is a special kind of clustering where the objective is to group observations which are similar in their statistical attributes, but also in their spatial location. In this sense, regionalization embeds the same logic as standard clustering techniques, but also applies a series of geographical constraints. Often, these constraints relate to connectivity: two candidates can only be grouped together in the same region if there exists a path from one member to another member that never leaves the region. These paths often model the spatial relationships in the data, such as contiguity or proximity. However, connectivity does not always need to hold for all regions, and in certain contexts it makes sense to relax connectivity or to impose different types of geographic constraints.
Objectives
In this take-home exercise you are required to regionalise Nigeria by using, but not limited to the following measures:
Total number of functional water points
Total number of nonfunctional water points
Percentage of functional water points
Percentage of non-functional water points
Percentage of main water point technology (i.e. Hand Pump)
Percentage of usage capacity (i.e. < 1000, >=1000)
Percentage of rural water points
The Data
Apstial data
For the purpose of this assignment, data from WPdx Global Data Repositories will be used. There are two versions of the data. They are: WPdx-Basic and WPdx+. You are required to use WPdx+ data set.
Geospatial data
Nigeria Level-2 Administrative Boundary (also known as Local Government Area) polygon features GIS data will be used in this take-home exercise. The data can be downloaded either from The Humanitarian Data Exchange portal or geoBoundaries.
The Task
The specific tasks of this take-home exercise are as follows:
Using appropriate sf method, import the shapefile into R and save it in a simple feature data frame format. Note that there are three Projected Coordinate Systems of Nigeria, they are: EPSG: 26391, 26392, and 26303. You can use any one of them.
Using appropriate tidyr and dplyr methods, derive the proportion of functional and non-functional water point at LGA level (i.e. ADM2).
Combining the geospatial and aspatial data frame into simple feature data frame.
Delineating water point measures functional regions by using conventional hierarchical clustering.
Delineating water point measures functional regions by using spatially constrained clustering algorithms.
Thematic Mapping
- Plot to show the water points measures derived by using appropriate statistical graphics and choropleth mapping technique.
Analytical Mapping
- Plot functional regions delineated by using both non-spatially constrained and spatially constrained clustering algorithms.
Glimpse of Steps
Some of the important steps performed in this study are as follows
Importing shapefile into R using sf package.
Deriving the proportion of functional and non-functional water point at LGA level using appropriate tidyr and dplyr methods.
Combining the geospatial and aspatial data frame into simple feature data frame.
Delineating water point measures functional regions by using conventional hierarchical clustering.
Delineating water point measures functional regions by using spatially constrained clustering algorithms.
Thematic Mapping - Plotting maps to show the water points measures derived by using appropriate statistical graphics and choropleth mapping technique.
Analytical Mapping - Plotting functional regions delineated by using both non-spatially constrained and spatially constrained clustering algorithms.
Loading packages
Let us first load required packages into R environment. p_load
function pf pacman package is used to install the following packages into R environment.
sf, rgdal and spdep - Spatial data handling
tidyverse, especially readr, ggplot2 and dplyr - Attribute data handling
tmap - Choropleth mapping
coorplot, ggpubr, and heatmaply - Multivariate data visualisation and analysis
cluster, ClustGeo - Cluster analysis
patchwork, ggthemes - Effective visualisation and layouts
Importing Spatial Data
Import spatial data. Since water point data set is in csv file format, we will use read_csv() of readr package to import WPdx.csv as shown the code chunk below. The output R object is called listings and it is a tibble data frame.
filter()
of dplyr package is used to extract water point records of Nigeria.
<- read_csv("data/WPdx.csv") %>%
wp_nga filter(`#clean_country_name` == "Nigeria")
st_as_sfc()
of sf package is used to derive a new field called Geometry as shown in the code chunk below
$Geometry = st_as_sfc(wp_nga$`New Georeferenced Column`) wp_nga
convert the tibble data frame into sf data frame.
<- st_sf(wp_nga, crs=4326) wp_sf
Importing Geospatial Data
Now let us import geospatial data. The code chunk below uses st_read() function of sf package to import geoBoundaries-NGA-ADM2 shapefile into R environment.
Two arguments are used :
dsn - destination : to define the data path
layer - to provide the shapefile name
crs - coordinate reference system in epsg format. EPSG: 4326 is wgs84 Geographic Coordinate System
<- st_read(dsn = "data", nga layer = "geoBoundaries-NGA-ADM2", crs = 4326) %>% select(shapeName)
Reading layer `geoBoundaries-NGA-ADM2' from data source `C:\HzzZZ11\ISSS624\Take-home_EX\Take-home Ex2\data' using driver `ESRI Shapefile' Simple feature collection with 774 features and 5 fields Geometry type: MULTIPOLYGON Dimension: XY Bounding box: xmin: 2.668534 ymin: 4.273007 xmax: 14.67882 ymax: 13.89442 Geodetic CRS: WGS 84
Checking duplicate area name
duplicated
can retrieve which elements of a vector or data frame are duplicate. The code chunk below can be used to determine the duplicate elements.
<- (nga[order(nga$shapeName), ])
nga <- nga %>%
ngamutate(shapeName = tolower(shapeName))
<- nga$shapeName[ nga$shapeName %in% nga$shapeName[duplicated(nga$shapeName)] ]
duplicate_Name
duplicate_Name
[1] "bassa" "bassa" "ifelodun" "ifelodun" "irepodun" "irepodun"
[7] "nasarawa" "nasarawa" "obi" "obi" "surulere" "surulere"
There are 12 duplicate elements. The below chunk assigns the right shape names corresponding to index values
$shapeName[c(94,95,304,305,355,356,518,519,546,547,693,694)] <- c("Bassa (Kogi)","Bassa (Plateau)",
nga"Ifelodun (Kwara)","Ifelodun (Osun)",
"Irepodun (Kwara)","Irepodun (Osun)",
"Nassarawa (Kano)","Nassarawa",
"Obi (Benue)","Obi(Nasarawa)",
"Surulere (Lagos)","Surulere (Oyo)")
length((nga$shapeName[ nga$shapeName %in% nga$shapeName[duplicated(nga$shapeName)] ]))
[1] 0
No duplicate values since the output is 0.
Combining geospatial and aspatial data
Transfer the attribute information in nga sf data frame into wp_sf data frame using st_join()function.
<- st_join(wp_sf, nga) wp_sf
Renaming the column names
Rename some of the column names which begins with # for ease of use by using rename() function
<- wp_sf %>%
wp_sfT rename ("Country" = "#clean_country_name",
"clean_adm2" = "#clean_adm2",
"status" = "#status_clean",
"lat" = "#lat_deg",
"long" = "#lon_deg",
"water_tech" = "#water_tech_category") %>%
mutate(status = replace_na(status, "Unknown"), water_tech = replace_na(water_tech, "Unknown")) %>%
select (water_tech,clean_adm2,status,lat,long,usage_capacity, is_urban)
Extracting Funtional, Non-Functional and Unknown water points
Extract water point records by using #status_clean column.
<- wp_sfT %>%
functional filter(`status` %in% c("Functional", "Functional but not in use" , "Functional but needs repair")) %>%
select(`lat`, `long`, `water_tech`, `clean_adm2`, `status`, `usage_capacity`, `is_urban`)
<- wp_sfT %>%
nonfunctional filter(`status` %in% c("Abandoned/Decommissioned", "Abandoned", "Non functional due to dry season", "Non-Functional", "Non-Functional due to dry season")) %>%
select(`lat`, `long`, `water_tech`, `clean_adm2`, `status`, `usage_capacity`, `is_urban`)
<- wp_sfT %>%
unknown_wp filter(`status` %in% c("Unknown")) %>%
select(`lat`, `long`, `water_tech`, `clean_adm2`, `status`, `usage_capacity`, `is_urban`)
<- wp_sfT %>%
handpump_count filter(`water_tech` %in% c("Hand Pump")) %>%
select(`lat`, `long`, `water_tech`, `clean_adm2`, `status`, `usage_capacity`, `is_urban`)
<- wp_sfT %>%
usageL1k filter(`usage_capacity` < 1000) %>%
select(`lat`, `long`, `water_tech`, `clean_adm2`, `status`, `usage_capacity`, `is_urban`)
<- wp_sfT %>%
usage1k filter(`usage_capacity` == 1000) %>%
select(`lat`, `long`, `water_tech`, `clean_adm2`, `status`, `usage_capacity`, `is_urban`)
<- wp_sfT %>%
ruralWP filter(`is_urban` == "FALSE") %>%
select(`lat`, `long`, `water_tech`, `clean_adm2`, `status`, `usage_capacity`, `is_urban`)
st_crs(nga)
Coordinate Reference System:
User input: EPSG:4326
wkt:
GEOGCRS["WGS 84",
ENSEMBLE["World Geodetic System 1984 ensemble",
MEMBER["World Geodetic System 1984 (Transit)"],
MEMBER["World Geodetic System 1984 (G730)"],
MEMBER["World Geodetic System 1984 (G873)"],
MEMBER["World Geodetic System 1984 (G1150)"],
MEMBER["World Geodetic System 1984 (G1674)"],
MEMBER["World Geodetic System 1984 (G1762)"],
MEMBER["World Geodetic System 1984 (G2139)"],
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ENSEMBLEACCURACY[2.0]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic latitude (Lat)",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["geodetic longitude (Lon)",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
USAGE[
SCOPE["Horizontal component of 3D system."],
AREA["World."],
BBOX[-90,-180,90,180]],
ID["EPSG",4326]]
st_crs(wp_sfT)
Coordinate Reference System:
User input: EPSG:4326
wkt:
GEOGCRS["WGS 84",
ENSEMBLE["World Geodetic System 1984 ensemble",
MEMBER["World Geodetic System 1984 (Transit)"],
MEMBER["World Geodetic System 1984 (G730)"],
MEMBER["World Geodetic System 1984 (G873)"],
MEMBER["World Geodetic System 1984 (G1150)"],
MEMBER["World Geodetic System 1984 (G1674)"],
MEMBER["World Geodetic System 1984 (G1762)"],
MEMBER["World Geodetic System 1984 (G2139)"],
ELLIPSOID["WGS 84",6378137,298.257223563,
LENGTHUNIT["metre",1]],
ENSEMBLEACCURACY[2.0]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
CS[ellipsoidal,2],
AXIS["geodetic latitude (Lat)",north,
ORDER[1],
ANGLEUNIT["degree",0.0174532925199433]],
AXIS["geodetic longitude (Lon)",east,
ORDER[2],
ANGLEUNIT["degree",0.0174532925199433]],
USAGE[
SCOPE["Horizontal component of 3D system."],
AREA["World."],
BBOX[-90,-180,90,180]],
ID["EPSG",4326]]
Below code chunks create new columns of total water point count, functional/non-functional, unknown water points count, high/low usage water point count and water point count in non-urban region by using st_intersects() function.
$WPCount <- lengths(st_intersects(nga, wp_sfT))
nga
$functional <- lengths(st_intersects(nga, functional))
nga
$nonfunctional <- lengths(st_intersects(nga, unknown_wp))
nga
$unknown_wp <- lengths(st_intersects(nga, nonfunctional))
nga
$handpump <- lengths(st_intersects(nga, handpump_count))
nga
$usage1k <- lengths(st_intersects(nga, usage1k))
nga
$usageL1k <- lengths(st_intersects(nga, usageL1k))
nga
$ruralWP <- lengths(st_intersects(nga, ruralWP)) nga
Also the ratios of functional/non-functional, unknown water points, high/low usage water point and water point in non-urban region are been created.
<- nga %>%
nga mutate(`pct_functional` = `functional`/`WPCount`) %>%
mutate(`pct_nonfunctional` = `nonfunctional`/`WPCount`) %>%
mutate(`pct_handpump` = `handpump`/`WPCount`) %>%
mutate(`pct_usage1k` = `usage1k`/`WPCount`) %>%
mutate(`pct_usageL1k` = `usageL1k`/`WPCount`) %>%
mutate(`pct_ruralWP` = `ruralWP`/`WPCount`)
Removing NA values
<- nga[-c(3, 86, 241, 250, 252, 261, 400, 406, 447, 473, 492, 507, 526),] nga
Replacing NA values
$`pct_functional`[is.na(nga$`pct_functional`)] <- 0
nga$`pct_nonfunctional`[is.na(nga$`pct_nonfunctional`)] <- 0
nga$`pct_handpump`[is.na(nga$`pct_handpump`)] <- 0
nga$`pct_usage1k`[is.na(nga$`pct_usage1k`)] <- 0
nga$`pct_usageL1k`[is.na(nga$`pct_usageL1k`)] <- 0
nga$`pct_ruralWP`[is.na(nga$`pct_ruralWP`)] <- 0 nga
Transform the coordinates from 4326 to 26391 projection using the st_transform() function.
<- st_transform(nga, crs = 26391)
nga_sf st_crs(nga_sf)
Coordinate Reference System:
User input: EPSG:26391
wkt:
PROJCRS["Minna / Nigeria West Belt",
BASEGEOGCRS["Minna",
DATUM["Minna",
ELLIPSOID["Clarke 1880 (RGS)",6378249.145,293.465,
LENGTHUNIT["metre",1]]],
PRIMEM["Greenwich",0,
ANGLEUNIT["degree",0.0174532925199433]],
ID["EPSG",4263]],
CONVERSION["Nigeria West Belt",
METHOD["Transverse Mercator",
ID["EPSG",9807]],
PARAMETER["Latitude of natural origin",4,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8801]],
PARAMETER["Longitude of natural origin",4.5,
ANGLEUNIT["degree",0.0174532925199433],
ID["EPSG",8802]],
PARAMETER["Scale factor at natural origin",0.99975,
SCALEUNIT["unity",1],
ID["EPSG",8805]],
PARAMETER["False easting",230738.26,
LENGTHUNIT["metre",1],
ID["EPSG",8806]],
PARAMETER["False northing",0,
LENGTHUNIT["metre",1],
ID["EPSG",8807]]],
CS[Cartesian,2],
AXIS["(E)",east,
ORDER[1],
LENGTHUNIT["metre",1]],
AXIS["(N)",north,
ORDER[2],
LENGTHUNIT["metre",1]],
USAGE[
SCOPE["Engineering survey, topographic mapping."],
AREA["Nigeria - onshore west of 6°30'E, onshore and offshore shelf."],
BBOX[3.57,2.69,13.9,6.5]],
ID["EPSG",26391]]
Exploratory Data Analysis
Bar-chart
In the code chunk below, freq() of funModeling package is used to display the distribution of status
, water_tech
, is_urbal
fields in wp_sfT
.
freq(data=wp_sfT,
input = 'status')
status frequency percentage cumulative_perc
1 Functional 45883 48.29 48.29
2 Non-Functional 29385 30.93 79.22
3 Unknown 10656 11.22 90.44
4 Functional but needs repair 4579 4.82 95.26
5 Non-Functional due to dry season 2403 2.53 97.79
6 Functional but not in use 1686 1.77 99.56
7 Abandoned/Decommissioned 234 0.25 99.81
8 Abandoned 175 0.18 99.99
9 Non functional due to dry season 7 0.01 100.00
Nigeria consists of almost 48% of functional, 30% of non-functional and 11% of unknown waterpoints.
freq(data=wp_sfT,
input = 'water_tech')
water_tech frequency percentage cumulative_perc
1 Hand Pump 58755 61.84 61.84
2 Mechanized Pump 25644 26.99 88.83
3 Unknown 10055 10.58 99.41
4 Tapstand 553 0.58 99.99
5 Rope and Bucket 1 0.00 100.00
Nigeria consists of about majority of 61% of hand pump, 26% of mechanized pump and 10% of unknown water technology.
freq(data=wp_sfT,
input = 'is_urban')
is_urban frequency percentage cumulative_perc
1 FALSE 75444 79.41 79.41
2 TRUE 19564 20.59 100.00
Nigeria consists of about majority of 79% of rural regions, 20% of urban regions.
Histogram
Histogram is useful to identify the overall distribution of the data values (i.e. left skew, right skew or normal distribution)
The distribution of waterpoint proportion attributes
The code chunks below are used to create multiple histograms to reveal the overall distribution of the selected variables in nga_wp_clus.
<- ggplot(data=nga_sf,
pct_functional aes(x= `pct_functional`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
<- ggplot(data=nga_sf,
pct_nonfunctional aes(x= `pct_nonfunctional`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
<- ggplot(data=nga_sf,
pct_handpump aes(x= `pct_handpump`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
<- ggplot(data=nga_sf,
pct_usageCap1k aes(x= `pct_usage1k`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
<- ggplot(data=nga_sf,
pct_usageCapLess1k aes(x= `pct_usageL1k`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
<- ggplot(data=nga_sf,
pct_ruralWP aes(x= `pct_ruralWP`)) +
geom_histogram(bins=20,
color="black",
fill="light blue")
ggarrange(pct_functional, pct_nonfunctional, pct_handpump, pct_usageCap1k, pct_usageCapLess1k, pct_ruralWP,
ncol = 3,
nrow = 2)
EDA using choropleth map
tm_shape(nga_sf) +
tm_polygons(c("pct_functional", "pct_nonfunctional", "pct_handpump","pct_usage1k","pct_usageL1k", "pct_ruralWP"),
style="jenks") +
tm_facets(sync = TRUE, ncol = 2, nrow = 3) +
tm_legend(legend.position = c("right", "bottom"), legend.title.size = 1.5,legend.text.size = 1)+
tm_layout(outer.margins=0, asp=0)
Correlation Analysis
Before we perform cluster analysis, it it important for us to ensure that the cluster variables are not highly correlated.
The code chunk below is used to visualise and analyse the correlation of the input variables.
str(nga_sf)
Classes 'sf' and 'data.frame': 761 obs. of 16 variables:
$ shapeName : chr "aba north" "aba south" "abaji" "abak" ...
$ geometry :sfc_MULTIPOLYGON of length 761; first list element: List of 1
..$ :List of 1
.. ..$ : num [1:71, 1:2] 552560 552452 552269 552016 551706 ...
..- attr(*, "class")= chr [1:3] "XY" "MULTIPOLYGON" "sfg"
$ WPCount : int 17 71 57 48 233 34 119 152 66 39 ...
$ functional : int 7 29 23 23 82 16 72 79 18 25 ...
$ nonfunctional : int 1 7 0 0 109 3 14 11 22 1 ...
$ unknown_wp : int 9 35 34 25 42 15 33 62 26 13 ...
$ handpump : int 2 7 23 4 102 5 20 91 1 12 ...
$ usage1k : int 14 62 34 44 22 26 84 50 43 26 ...
$ usageL1k : int 3 9 23 4 211 8 35 102 23 13 ...
$ ruralWP : int 0 4 48 40 204 7 0 145 48 21 ...
$ pct_functional : num 0.412 0.408 0.404 0.479 0.352 ...
$ pct_nonfunctional: num 0.0588 0.0986 0 0 0.4678 ...
$ pct_handpump : num 0.1176 0.0986 0.4035 0.0833 0.4378 ...
$ pct_usage1k : num 0.8235 0.8732 0.5965 0.9167 0.0944 ...
$ pct_usageL1k : num 0.1765 0.1268 0.4035 0.0833 0.9056 ...
$ pct_ruralWP : num 0 0.0563 0.8421 0.8333 0.8755 ...
- attr(*, "sf_column")= chr "geometry"
- attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA NA NA NA NA NA ...
..- attr(*, "names")= chr [1:15] "shapeName" "WPCount" "functional" "nonfunctional" ...
<- nga_sf %>%
nga_sf_var st_drop_geometry() %>%
select("shapeName", "functional","nonfunctional", "pct_functional", "pct_nonfunctional", "pct_handpump","pct_usage1k","pct_usageL1k", "pct_ruralWP")
= cor(nga_sf_var[,2:8])
cluster_vars.cor corrplot.mixed(cluster_vars.cor,
lower = "ellipse",
upper = "number",
tl.pos = "lt",
diag = "l",
tl.col = "blue")
Hierarchy Cluster Analysis
The code chunk below will be used to extract the clustering variables from the nga_sf simple feature object into data.frame.
<- nga_sf_var %>%
cluster_vars select("shapeName", "pct_functional", "pct_nonfunctional", "pct_handpump", "pct_usageL1k", "pct_ruralWP")
head(cluster_vars,10)
shapeName pct_functional pct_nonfunctional pct_handpump pct_usageL1k
1 aba north 0.4117647 0.05882353 0.11764706 0.17647059
2 aba south 0.4084507 0.09859155 0.09859155 0.12676056
4 abaji 0.4035088 0.00000000 0.40350877 0.40350877
5 abak 0.4791667 0.00000000 0.08333333 0.08333333
6 abakaliki 0.3519313 0.46781116 0.43776824 0.90557940
7 abeokuta north 0.4705882 0.08823529 0.14705882 0.23529412
8 abeokuta south 0.6050420 0.11764706 0.16806723 0.29411765
9 abi 0.5197368 0.07236842 0.59868421 0.67105263
10 aboh-mbaise 0.2727273 0.33333333 0.01515152 0.34848485
11 abua/odual 0.6410256 0.02564103 0.30769231 0.33333333
pct_ruralWP
1 0.00000000
2 0.05633803
4 0.84210526
5 0.83333333
6 0.87553648
7 0.20588235
8 0.00000000
9 0.95394737
10 0.72727273
11 0.53846154
= cor(cluster_vars[,2:6])
cluster_vars.cor corrplot.mixed(cluster_vars.cor,
lower = "ellipse",
upper = "number",
tl.pos = "lt",
diag = "l",
tl.col = "blue")
row.names(cluster_vars) <- cluster_vars$"shapeName"
head(cluster_vars,10)
shapeName pct_functional pct_nonfunctional pct_handpump
aba north aba north 0.4117647 0.05882353 0.11764706
aba south aba south 0.4084507 0.09859155 0.09859155
abaji abaji 0.4035088 0.00000000 0.40350877
abak abak 0.4791667 0.00000000 0.08333333
abakaliki abakaliki 0.3519313 0.46781116 0.43776824
abeokuta north abeokuta north 0.4705882 0.08823529 0.14705882
abeokuta south abeokuta south 0.6050420 0.11764706 0.16806723
abi abi 0.5197368 0.07236842 0.59868421
aboh-mbaise aboh-mbaise 0.2727273 0.33333333 0.01515152
abua/odual abua/odual 0.6410256 0.02564103 0.30769231
pct_usageL1k pct_ruralWP
aba north 0.17647059 0.00000000
aba south 0.12676056 0.05633803
abaji 0.40350877 0.84210526
abak 0.08333333 0.83333333
abakaliki 0.90557940 0.87553648
abeokuta north 0.23529412 0.20588235
abeokuta south 0.29411765 0.00000000
abi 0.67105263 0.95394737
aboh-mbaise 0.34848485 0.72727273
abua/odual 0.33333333 0.53846154
<- select(cluster_vars, c(2:6))
nga_cluster_var head(nga_cluster_var, 10)
pct_functional pct_nonfunctional pct_handpump pct_usageL1k
aba north 0.4117647 0.05882353 0.11764706 0.17647059
aba south 0.4084507 0.09859155 0.09859155 0.12676056
abaji 0.4035088 0.00000000 0.40350877 0.40350877
abak 0.4791667 0.00000000 0.08333333 0.08333333
abakaliki 0.3519313 0.46781116 0.43776824 0.90557940
abeokuta north 0.4705882 0.08823529 0.14705882 0.23529412
abeokuta south 0.6050420 0.11764706 0.16806723 0.29411765
abi 0.5197368 0.07236842 0.59868421 0.67105263
aboh-mbaise 0.2727273 0.33333333 0.01515152 0.34848485
abua/odual 0.6410256 0.02564103 0.30769231 0.33333333
pct_ruralWP
aba north 0.00000000
aba south 0.05633803
abaji 0.84210526
abak 0.83333333
abakaliki 0.87553648
abeokuta north 0.20588235
abeokuta south 0.00000000
abi 0.95394737
aboh-mbaise 0.72727273
abua/odual 0.53846154
Data Standardization Standardization as it can handle outliers well and it is applicable to variables which are from normal distribution Z-score
<- normalize(nga_cluster_var)
nga_cluster_var.std summary(nga_cluster_var.std)
pct_functional pct_nonfunctional pct_handpump pct_usageL1k
Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :0.0000
1st Qu.:0.3333 1st Qu.:0.0000 1st Qu.:0.1860 1st Qu.:0.4157
Median :0.4792 Median :0.0000 Median :0.5255 Median :0.6807
Mean :0.5070 Mean :0.1277 Mean :0.4956 Mean :0.6182
3rd Qu.:0.6749 3rd Qu.:0.2105 3rd Qu.:0.7857 3rd Qu.:0.8750
Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. :1.0000
pct_ruralWP
Min. :0.0000
1st Qu.:0.5922
Median :0.8717
Mean :0.7395
3rd Qu.:1.0000
Max. :1.0000
<- scale(nga_cluster_var)
nga_cluster_var.z describe(nga_cluster_var.z)
vars n mean sd median trimmed mad min max range skew
pct_functional 1 761 0 1 -0.12 -0.03 1.04 -2.16 2.10 4.25 0.21
pct_nonfunctional 2 761 0 1 -0.62 -0.22 0.00 -0.62 4.27 4.89 1.98
pct_handpump 3 761 0 1 0.09 0.01 1.34 -1.53 1.56 3.09 -0.11
pct_usageL1k 4 761 0 1 0.22 0.09 1.07 -2.13 1.31 3.44 -0.60
pct_ruralWP 5 761 0 1 0.42 0.17 0.60 -2.35 0.83 3.18 -1.17
kurtosis se
pct_functional -0.67 0.04
pct_nonfunctional 4.04 0.04
pct_handpump -1.31 0.04
pct_usageL1k -0.80 0.04
pct_ruralWP 0.10 0.04
Visualising the standardised clustering variables
Before performing clustering analysis, it is good to visualise the distribution of variables.
<- ggplot(data=nga_sf,
r aes(x= `pct_functional`)) +
geom_histogram(bins=20,
color="blue",
fill="#E69F00") +
ggtitle("Raw values without standardisation")
<- as.data.frame(nga_cluster_var.std)
nga_cluster_s_df <- ggplot(data=nga_cluster_s_df,
s aes(x=`pct_functional`)) +
geom_histogram(bins=20,
color="blue",
fill="#E69F00") +
ggtitle("Min-Max Standardisation")
<- as.data.frame(nga_cluster_var.z)
nga_cluster_z_df <- ggplot(data=nga_cluster_z_df,
z aes(x=`pct_functional`)) +
geom_histogram(bins=20,
color="blue",
fill="#E69F00") +
ggtitle("Z-score Standardisation")
ggarrange(r, s, z,
ncol = 3,
nrow = 1)
Computing Proximity Matrix
A proximity matrix is a square matrix (two-dimensional array) containing the distances, taken pairwise between the elements of a matrix. Broadly defined; a proximity matrix measures the similarity or dissimilarity between the pairs of matrix.
<- dist(nga_cluster_var, method = 'euclidean') proxmat
Computing hierarchical clustering
hclust() can be used to perform hierarchical clustering. It supports eight clustering algorithms: ward.D, ward.D2, single, complete, average(UPGMA), mcquitty(WPGMA), median(WPGMC) and centroid(UPGMC).
The code chunk below is used to compute hierarchical clustering and adopt centroid method.
<- hclust(proxmat, method = 'ward.D') hclust_ward
Selecting the optimal clustering algorithm by hierarchical clustering
agnes() function of cluster package cane calculate the agglomerative coefficient, which measures the amount of clustering structure.
<- c( "average", "single", "complete", "ward")
m names(m) <- c( "average", "single", "complete", "ward")
<- function(x) {
ac agnes(nga_cluster_var, method = x)$ac
}
map_dbl(m, ac)
average single complete ward
0.9320242 0.8841971 0.9557188 0.9930958
Determining Optimal Clusters
There are three ways to determine the optimal clusters.
Elbow method
Average Silhouette Method
Gap Statistic Method
We are going to adopt Gap Statistic Method to determine the optimal clusters. The gap statistic compares the total within intra-cluster variation for different values of k with their expected values under null reference distribution of the data. The estimate of the optimal clusters will be value that maximize the gap statistic. This means that the clustering structure is far away from the random uniform distribution of points.
set.seed(1234)
<- clusGap(nga_cluster_var,
gap_stat FUN = hcut,
nstart = 25,
K.max = 10,
B = 50)
# Print the result
print(gap_stat, method = "firstmax")
Clustering Gap statistic ["clusGap"] from call:
clusGap(x = nga_cluster_var, FUNcluster = hcut, K.max = 10, B = 50, nstart = 25)
B=50 simulated reference sets, k = 1..10; spaceH0="scaledPCA"
--> Number of clusters (method 'firstmax'): 5
logW E.logW gap SE.sim
[1,] 5.026463 5.468043 0.4415796 0.008773640
[2,] 4.784723 5.341867 0.5571439 0.013342806
[3,] 4.670536 5.257902 0.5873658 0.009107344
[4,] 4.588263 5.189318 0.6010551 0.009205690
[5,] 4.464084 5.136493 0.6724086 0.009032431
[6,] 4.420326 5.091754 0.6714280 0.008259398
[7,] 4.366689 5.051587 0.6848981 0.008286133
[8,] 4.333252 5.019084 0.6858326 0.008194347
[9,] 4.291669 4.993013 0.7013443 0.007706852
[10,] 4.251836 4.970142 0.7183058 0.007584662
Then, we can visualize the cluster plots.
fviz_gap_stat(gap_stat)
Interpreting the dendrogram
plot(hclust_ward,cex=0.6)
rect.hclust(hclust_ward,
k = 4,
border = 2:5)
Visually-driven hierarchical clustering analysis
We are going to build both interactive and static cluster heatmaps with the help of heatmaply() package.
Transforming the data frame into a matrix
The code chunk below is used to transform lga_ict data frame into a data matrix.
<- data.matrix(nga_cluster_var) nga_cluster_var_mat
Plotting Interactive Cluster Heatmap
heatmaply(normalize(nga_cluster_var_mat),
Colv=NA,
dist_method = "euclidean",
hclust_method = "ward.D",
seriate = "OLO",
colors = Reds,
k_row = 5,
margins = c(NA,200,60,NA),
fontsize_row = 4,
fontsize_col = 5,
main="Geographic Segmentation of Nigeria WP indicators",
xlab = "ICT Indicators",
ylab = "ShapeName"
)
Mapping the clusters formed
The code chunk below is used to derive a 5-cluster model based on hierarchical clustering.
<- as.factor(cutree(hclust_ward, k=5))
groups <- cbind(nga_sf, as.matrix(groups)) %>%
nga_sf_cluster rename(`CLUSTER`=`as.matrix.groups.`)
qtm(nga_sf_cluster, "CLUSTER")
Elbow Method
The Elbow Method looks at the total WSS(within-cluster sum of Square) as a function of the number clusters.
set.seed(1234)
# function to compute total within-cluster sum of squares
fviz_nbclust(nga_cluster_var, hcut, method = "wss", k.max = 10) + theme_minimal() + ggtitle("the Elbow Method")
Plot K-means
<- kmeans(nga_cluster_var,4,nstart = 25)
kmm fviz_cluster(kmm, data = nga_cluster_var,
geom = "point",
ellipse.type = "convex",
ggtheme = theme_bw()
)
Multivariate Visualisation
Past studies shown that parallel coordinate plot can be used to reveal clustering variables by cluster very effectively. In the code chunk below, ggparcoord()
of GGally package
ggparcoord(data = nga_sf_cluster,
columns = c(10:15),
groupColumn = "CLUSTER",
scale = "std",
alphaLines = 0.2,
boxplot = TRUE,
title = "Multiple Parallel Coordinates Plots of Nigeria Variables by Cluster") +
facet_grid(~ CLUSTER) +
theme(axis.text.x = element_text(angle = 30, size = 15)) +
scale_color_viridis(option = "C", discrete=TRUE)
In the code chunk below, group_by()
and summarise()
of dplyr are used to derive mean values of the clustering variables.
%>%
nga_sf_cluster st_set_geometry(NULL) %>%
group_by(CLUSTER) %>%
summarise(mean_pct_functional = mean(pct_functional),
mean_pct_nonfunctional = mean(pct_nonfunctional),
mean_pct_handpump = mean(pct_handpump),
mean_pct_usage1k = mean(pct_usage1k),
mean_pct_usageL1k = mean(pct_usageL1k),
mean_pct_ruralWP = mean(pct_ruralWP))
# A tibble: 5 × 7
CLUSTER mean_pct_functional mean_pct_nonfunc…¹ mean_…² mean_…³ mean_…⁴ mean_…⁵
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.571 0.0847 0.311 0.600 0.400 0.299
2 2 0.440 0.144 0.574 0.281 0.719 0.871
3 3 0.334 0.0426 0.134 0.809 0.191 0.943
4 4 0.195 0.575 0.113 0.407 0.593 0.567
5 5 0.750 0.00218 0.871 0.123 0.877 0.922
# … with abbreviated variable names ¹mean_pct_nonfunctional,
# ²mean_pct_handpump, ³mean_pct_usage1k, ⁴mean_pct_usageL1k,
# ⁵mean_pct_ruralWP
Computing spatially constrained clusters using SKATER method
The code chunk below is used to compute the spatially constrained clusters using as_Spatial() of poly2nd( ) packages.
<- as_Spatial(nga_sf) nga_sp
<- poly2nb(nga_sp, queen=TRUE)
nga.nb summary(nga.nb)
Neighbour list object:
Number of regions: 761
Number of nonzero links: 4348
Percentage nonzero weights: 0.750793
Average number of links: 5.713535
Link number distribution:
1 2 3 4 5 6 7 8 9 10 11 12 14
4 16 57 122 177 137 121 71 39 11 4 1 1
4 least connected regions:
138 509 525 560 with 1 link
1 most connected region:
508 with 14 links
<- poly2nb(nga_sp, queen=TRUE)
nga.nb summary(nga.nb)
Neighbour list object:
Number of regions: 761
Number of nonzero links: 4348
Percentage nonzero weights: 0.750793
Average number of links: 5.713535
Link number distribution:
1 2 3 4 5 6 7 8 9 10 11 12 14
4 16 57 122 177 137 121 71 39 11 4 1 1
4 least connected regions:
138 509 525 560 with 1 link
1 most connected region:
508 with 14 links
Computing minimum spanning tree
The code chunk below is used to compute the cost of each edge.
<- nbcosts(nga.nb, nga_cluster_var) lcosts
<- nb2listw(nga.nb,
nga.w
lcosts, style="B")
summary(nga.w)
Characteristics of weights list object:
Neighbour list object:
Number of regions: 761
Number of nonzero links: 4348
Percentage nonzero weights: 0.750793
Average number of links: 5.713535
Link number distribution:
1 2 3 4 5 6 7 8 9 10 11 12 14
4 16 57 122 177 137 121 71 39 11 4 1 1
4 least connected regions:
138 509 525 560 with 1 link
1 most connected region:
508 with 14 links
Weights style: B
Weights constants summary:
n nn S0 S1 S2
B 761 579121 1989.853 2574.866 26667.43
mstree() function of spdep packages can be used to compute minimum spanning tree.
<- mstree(nga.w) nga.mst
We can have a look at the dimension of the MST.
dim(nga.mst)
[1] 760 3
We can plot the MST to show the observation numbers of the nodes along with the LGA boundaries.
plot(nga_sp, border=gray(.5))
plot.mst(nga.mst,
coordinates(nga_sp),
col="blue",
cex.lab=0.7,
cex.circles=0.005,
add=TRUE)
Computing spatially constrained clusters using SKATER method
The code chunk below is used to compute the spatially constrained clusters using skater() of spdep packages.
<- spdep::skater(edges = nga.mst[,1:2],
clust4 data = nga_cluster_var,
method = "euclidean",
ncuts = 4)
str(clust4)
List of 8
$ groups : num [1:761] 2 2 5 2 4 3 3 4 2 3 ...
$ edges.groups:List of 5
..$ :List of 3
.. ..$ node: num [1:208] 479 760 442 166 95 94 137 404 699 103 ...
.. ..$ edge: num [1:207, 1:3] 406 253 130 377 108 467 414 432 692 223 ...
.. ..$ ssw : num 87.1
..$ :List of 3
.. ..$ node: num [1:124] 33 102 65 101 332 717 539 214 9 359 ...
.. ..$ edge: num [1:123, 1:3] 9 359 538 214 539 717 332 101 65 102 ...
.. ..$ ssw : num 62.8
..$ :List of 3
.. ..$ node: num [1:161] 363 27 585 310 311 545 194 559 334 573 ...
.. ..$ edge: num [1:160, 1:3] 311 310 545 194 559 334 573 199 14 567 ...
.. ..$ ssw : num 85.1
..$ :List of 3
.. ..$ node: num [1:75] 234 723 443 161 711 461 413 722 408 727 ...
.. ..$ edge: num [1:74, 1:3] 13 564 533 408 455 215 491 401 281 408 ...
.. ..$ ssw : num 19.6
..$ :List of 3
.. ..$ node: num [1:193] 66 499 395 121 23 514 651 761 466 480 ...
.. ..$ edge: num [1:192, 1:3] 66 499 395 121 537 23 627 118 298 544 ...
.. ..$ ssw : num 81.4
$ not.prune : NULL
$ candidates : int [1:5] 1 2 3 4 5
$ ssto : num 441
$ ssw : num [1:5] 441 409 364 350 336
$ crit : num [1:2] 1 Inf
$ vec.crit : num [1:761] 1 1 1 1 1 1 1 1 1 1 ...
- attr(*, "class")= chr "skater"
<- clust4$groups
ccs4 ccs4
[1] 2 2 5 2 4 3 3 4 2 3 5 5 4 3 5 3 4 2 5 4 3 2 5 2 3 3 3 3 5 3 3 1 2 3 1 3 5
[38] 5 3 5 2 3 5 5 5 1 5 3 1 3 3 3 2 2 2 3 3 5 3 4 3 1 5 5 2 5 3 1 5 5 3 3 5 1
[75] 4 2 2 2 3 1 3 1 5 1 1 5 1 1 1 4 5 5 4 1 1 1 1 1 1 4 2 2 1 5 1 1 5 1 1 5 5
[112] 1 1 5 1 4 4 5 3 3 5 5 5 3 1 5 1 1 5 1 4 3 5 2 2 5 1 1 5 1 1 1 1 1 1 5 5 5
[149] 1 1 1 1 1 1 3 3 1 1 1 5 4 1 2 1 1 1 2 4 5 5 5 5 5 3 3 3 5 3 2 5 3 2 5 5 5
[186] 5 2 3 2 5 2 2 2 3 3 3 3 3 3 2 2 2 3 3 3 2 2 3 3 3 4 3 2 2 4 4 1 5 1 1 1 1
[223] 1 5 1 5 1 1 1 1 1 1 1 4 4 1 5 4 1 1 1 5 2 1 1 5 5 1 4 1 5 5 1 1 1 5 5 1 5
[260] 1 1 4 4 1 1 1 1 1 3 3 3 3 3 3 3 3 3 3 2 2 4 2 3 5 2 2 2 2 3 5 3 5 5 5 5 5
[297] 5 5 5 3 5 3 2 2 2 3 2 2 3 3 3 3 5 5 2 3 3 1 2 3 3 5 5 4 2 3 2 2 3 3 4 2 5
[334] 3 5 5 5 5 5 5 5 5 3 1 2 3 3 3 5 5 5 3 5 5 3 4 4 2 2 2 5 2 3 3 3 2 1 3 2 2
[371] 3 3 4 5 1 1 1 1 1 5 5 1 1 4 4 4 5 1 5 5 5 1 1 5 5 1 3 1 5 1 4 1 4 1 1 1 1
[408] 4 5 1 5 1 4 1 5 1 5 1 4 5 5 2 1 1 1 1 1 5 5 5 3 1 4 5 3 1 1 5 1 1 1 1 4 1
[445] 5 1 4 5 1 5 3 3 3 1 4 4 5 1 5 1 4 5 1 1 1 5 1 1 1 1 1 5 1 5 1 1 4 1 1 5 1
[482] 5 5 1 1 1 2 2 1 1 4 1 1 2 5 5 1 5 5 1 1 1 3 5 1 1 1 5 5 3 3 3 1 5 2 1 1 2
[519] 2 4 2 2 2 2 2 2 2 2 1 2 3 4 4 4 2 3 5 2 2 4 4 3 3 5 3 2 5 3 3 2 3 3 5 5 5
[556] 4 3 2 3 2 2 2 4 4 4 2 3 4 5 5 5 2 3 2 3 4 3 3 3 5 3 3 3 2 3 5 5 4 2 3 2 2
[593] 3 3 3 5 5 2 5 2 2 2 2 2 2 2 3 3 3 3 2 5 4 3 3 3 3 2 2 2 5 5 2 3 3 3 5 5 4
[630] 3 5 3 1 4 5 5 1 3 5 1 1 1 4 1 1 5 5 5 1 3 5 3 3 1 5 3 4 3 5 5 1 1 1 4 1 1
[667] 5 1 3 5 1 5 5 1 3 1 5 1 5 3 5 5 1 2 1 4 5 5 5 1 4 1 1 1 1 5 1 1 1 1 5 2 2
[704] 3 2 3 3 2 3 2 4 2 2 3 2 2 2 1 2 2 2 4 4 3 2 2 4 5 5 1 1 3 3 3 5 4 1 4 5 5
[741] 5 5 5 4 4 1 1 5 3 1 1 1 1 1 1 1 1 5 1 1 5
we can also plot the pruned tree that shows the five clusters on top of the admin 2 area.
plot(nga_sp, border=gray(.5))
plot(clust4,
coordinates(nga_sp),
cex.lab=.7,
groups.colors=c("blue","red","brown", "green", "pink"),
cex.circles=0.005,
add=TRUE)
Visualising the clusters in choropleth map
The code chunk below is used to plot the clusters derived by SKATER method.
<- as.matrix(clust4$groups)
groups_mat <- cbind(nga_sf_cluster, as.factor(groups_mat)) %>%
nga_sf_spatialcluster rename(`SP_CLUSTER`=`as.factor.groups_mat.`)
qtm(nga_sf_spatialcluster, "SP_CLUSTER")
In the code chunk below, group_by()
and summarise()
of dplyr are used to derive mean values of the clustering variables.
%>%
nga_sf_spatialcluster st_set_geometry(NULL) %>%
group_by(SP_CLUSTER) %>%
summarise(mean_pct_functional = mean(pct_functional),
mean_pct_nonfunctional = mean(pct_nonfunctional),
mean_pct_handpump = mean(pct_handpump),
mean_pct_usage1k = mean(pct_usage1k),
mean_pct_usageL1k = mean(pct_usageL1k),
mean_pct_ruralWP = mean(pct_ruralWP))
# A tibble: 5 × 7
SP_CLUSTER mean_pct_functional mean_pct_nonf…¹ mean_…² mean_…³ mean_…⁴ mean_…⁵
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.715 0.0427 0.778 0.215 0.785 0.806
2 2 0.307 0.282 0.145 0.572 0.428 0.698
3 3 0.449 0.150 0.218 0.615 0.385 0.619
4 4 0.361 0.266 0.590 0.146 0.854 0.921
5 5 0.516 0.0474 0.612 0.337 0.663 0.725
# … with abbreviated variable names ¹mean_pct_nonfunctional,
# ²mean_pct_handpump, ³mean_pct_usage1k, ⁴mean_pct_usageL1k,
# ⁵mean_pct_ruralWP
Place both the hierarchical clustering and spatially constrained hierarchical clustering maps next to each other.
<- qtm(nga_sf_cluster,
hclust.map "CLUSTER", title = "Hierarchical clustering") +
tm_borders(alpha = 0.5)
<- qtm(nga_sf_spatialcluster,
shclust.map "SP_CLUSTER", title = "spatially constrained clusters using SKATER method") +
tm_borders(alpha = 0.5)
tmap_arrange(hclust.map, shclust.map,
asp=NA, ncol=2)
Spatially Constrained Clustering: ClustGeo Method
In this section, we are going to use ClustGeo package to perform non-spatially constrained hierarchical cluster analysis and spatially constrained cluster analysis.
Ward-like hierarchical clustering: ClustGeo
To perform non-spatially constrained hierarchical cluster, we only need to provide the function a dissimilarity matrix as shown in the code chunk below.
<- hclustgeo(proxmat)
nongeo_cluster plot(nongeo_cluster, cex = 0.5)
rect.hclust(nongeo_cluster,
k = 5,
border = 2:5)
Mapping the clusters formed
<- as.factor(cutree(nongeo_cluster, k=5))
groups <- cbind(nga_sf, as.matrix(groups)) %>%
nga_sf_ngeo_cluster rename(`CLUSTER` = `as.matrix.groups.`)
qtm(nga_sf_ngeo_cluster, "CLUSTER")
Spatially Constrained Clustering
A spatial distance matrix will be derived by using st_distance()
of sf package before we perform spatially constrained hierarchical clustering.
<- st_distance(nga_sf, nga_sf)
dist <- as.dist(dist) distmat
<- choicealpha(proxmat, distmat, range.alpha = seq(0, 1, 0.1), K=5, graph = TRUE) cr
<- hclustgeo(proxmat, distmat, alpha = 0.4)
clustG <- as.factor(cutree(clustG, k=5))
groups <- cbind(nga_sf, as.matrix(groups)) %>%
nga_sf_Gcluster rename(`CLUSTER` = `as.matrix.groups.`)
qtm(nga_sf_Gcluster, "CLUSTER")
In the code chunk below, group_by()
and summarise()
of dplyr are used to derive mean values of the clustering variables.
%>%
nga_sf_Gcluster st_set_geometry(NULL) %>%
group_by(CLUSTER) %>%
summarise(mean_pct_functional = mean(pct_functional),
mean_pct_nonfunctional = mean(pct_nonfunctional),
mean_pct_handpump = mean(pct_handpump),
mean_pct_usage1k = mean(pct_usage1k),
mean_pct_usageL1k = mean(pct_usageL1k),
mean_pct_ruralWP = mean(pct_ruralWP))
# A tibble: 5 × 7
CLUSTER mean_pct_functional mean_pct_nonfunc…¹ mean_…² mean_…³ mean_…⁴ mean_…⁵
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 0.540 0.117 0.363 0.509 0.491 0.191
2 2 0.344 0.200 0.147 0.643 0.357 0.803
3 3 0.391 0.216 0.542 0.242 0.758 0.893
4 4 0.683 0.00508 0.807 0.184 0.816 0.892
5 5 0.603 0.117 0.622 0.344 0.656 0.713
# … with abbreviated variable names ¹mean_pct_nonfunctional,
# ²mean_pct_handpump, ³mean_pct_usage1k, ⁴mean_pct_usageL1k,
# ⁵mean_pct_ruralWP
<- qtm(nga_sf_ngeo_cluster,
ngeoclust.map "CLUSTER", title = "Ward-like hierarchical clustering") +
tm_borders(alpha = 0.5)
<- qtm(nga_sf_Gcluster,
gcluster.map "CLUSTER", title = "Spatially Constrained Hierarchical Clustering") +
tm_borders(alpha = 0.5)
tmap_arrange(ngeoclust.map, gcluster.map,
asp=NA, ncol=2)
Visualisation on all clustering
The code chunk below is used to plot the clusters result together.
tmap_arrange(hclust.map, shclust.map, ngeoclust.map, gcluster.map, ncol = 2, asp = 1)
THE END.